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Abstract 

While there has been a significant amount of work studying gradient descent techniques for non-convex 
optimization problems over the last few years, all existing results establish either local convergence with 
good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, 
we take the first step in getting the best of both worlds - establishing global convergence and obtaining 
a good rate of convergence for the problem of computing sqnareroot of a positive definite (PD) matrix, 
which is a widely studied problem in numerical linear algebra with applications in machine learning and 
statistics among others. 

Given a PD matrix M and a PD starting point Uo, we show that gradient descent with appropriately 
chosen step-size finds an e-accurate sqnareroot of M in O (alog(j|M — Uo^|g/e)) iterations, where 

a = (max{||Uo |i2 , IIMII2} / min{CT^in(Uo), cTmin Our result is the first to establish global 

convergence for this problem and that it is robust to errors in each iteration. A key contribution of our 
work is the general proof technique which we believe should further excite research in understanding 
deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global 
convergence rates for other problems in machine learning and numerical linear algebra. 


1 Introduction 


Given that a large number of problems and frameworks in machine learning are non-convex optimization 


problems (examples include non-negative matrix factorization Lee and Seung 

2001 , sparse coding Aharon 

et al. 2006], matrix sensing Recht et al. 

2010 , matrix completion Koren et al. 

2009 , phase retrieval Ne- 


trapalli et al.[ 2015| etc.), in the last few years, there has been an increased interest in designing efficient 


non-convex optimization algorithms. Several recent works establish local convergence to the global optimum 
for problems such as matrix sensing [Jain et al., 2013 Tu et al. 2015 , matrix completion Jain and Netrapalli 


2014 , Sun and Luo 2015| , phase retrieval [Candes et al. 2015 , sparse coding Agarwal et al. 2013 and so on 
(and hence, require careful initialization). However, despite strong empirical evidence, none of these results 


have been able to establish global convergence. On the other hand some other recent works Nesterov and 
Polyak[ 2006, Ge et al. 2015, Lee et ah] 2016 Sun et al. 2015 establish the global convergence of gradient 


descent methods to local minima for a large class of non-convex problems but the results they obtain are 
quite suboptimal compared to the local convergence results mentioned above. In other words, results that 
have very good rates are only local (and results that are global do not have very good rates). 

Therefore, a natural and important question is if gradient descent actually has a good global convergence 
rate when applied to specific and important functions that are of interest in machine learning. Apart from 
theoretical implications, such a result is also important in practice since a) finding a good initialization might 
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be difficult and b) local convergence results are inherently difficult to extend to stochastic algorithms due to 
noise. 


In this work, we answer the above question in affirmative for the problem of computing square root of 
a positive definite (PD) matrix M: i.e., minu^o /(U) where /(U) = |jM — U^|||.. This problem in itself is 


a fundamental one and arises in several contexts such as computation of the matrix sign function 

|Higham 

2008 

, computation of data whitening matrices, signal processing applications [Kaminski et al. 

1971 

[Carlson 

1990 

Van Der Merwe and Wan 

2001 

Tippett et al. 

2003 and so on. 


1.1 Related work 


Given the importance of computing the matrix squareroot, there has been a tremendous amount of work in 


the numerical linear algebra community focused on this problem Bjorck and Hammarling 

1983 

Higham 

1986 

1987 

1997 

Meini 

2004 . For a detailed list of references, see Chapter 6 in Higham’s 

book 

Higham 


2008 


The basic component of most these algorithms is the Newton’s method to find the square root of a 
positive number. Given a positive number m and a positive starting point Ug, Newton’s method gives rise 
to the iteration 


1 / m\ 


( 1 ) 


It can be shown that the iterates converge to at a quadratic rate (i.e., e-accuracy in log log ^ iterations). 
The extension of this approach to the matrix case is not straight forward due to non commutativity of 
matrix multiplication. For instance, if M and Ut were matrices, it is not clear if ^ should be replaced 
by Ut"^M or MUt ^ or something else. One approach to overcome this issue is to select Uo carefully 
to ensure commutativity through all iterations Higham 1986, 1997 Meini 2004 , for example, Uq = M 


or Uo = I. However, commutativity is a brittle property and small numerical errors in an iteration itself 
can result in loss of commutativity. Although a lot of work since, has focused on designing stable iterations 
that are inspired by Eq. 0 E igham[ |1986[ |1997[ |Meini[ |2004| , and has succeeded in making it robust in 


practice, no provable robustness guarantees are known in the presence of repeated errors. Similarly, another 
recent approach by Sra| [2015| uses geometric optimization to solve the matrix squareroot problem but their 
analysis also does not address the stability or robustness to numerical or statistical errors (if we see a noisy 
version of M) . 

Another approach to solve the matrix square-root problem is to use the eigenvalue decomposition (EVD) 
and then take square-root of the eigenvalues. To the best of our knowledge, state-of-the-art computation 


complexity f or computing the EVD of a matrix (in the real arithmetic model of computation) is due to Pan 


et al. 


1998 


which is O (n“ log n-I-n log^ n log log i) for matrices with distinct eigenvalues. Though the 
result IS close to optimal (in reducing the EVD to matrix multiplication), the algorithm and the analysis are 
quite complicated. For instance robustness of these methods to errors is not well understood. As mentioned 
above however, our focus is to understand if local search techniques like gradient descent (which are often 
applied to several non-convex optimization procedures) indeed avoid saddle points and local minima, and 
can guide the solution to global optimum. 

As we mentioned earlier, Ge et al. 2015], Lee et al. [2016 give some recent results on global convergence 


for general non-convex problems which can be applied to matrix squareroot problem. While Lee et al. 


prove only asymptotic behavior of gradient descent without any rate, applying the result of Ge et al. 


2016 


2015 


gives us a runtime of O (n^°/poly (e ))Q which is highly suboptimal in terms of its dependence on n and 


Finally, we note that subsequent to this work, Jin et al. 2017 proved global convergence results with 


almost sharp dimension dependence for a much wider class of functions. While Jin et al. 2017 explicitly 


add perturbation to help escape saddle points, our framework does not require perturbation, and shows that 
for this problem, gradient descent naturally stays away from saddle points. 

proves convergence in the number of iteration of O (d^), with 


'^For optimization problem of dimension d, 


Ge et al. 


2015 


O (d) computation per iteration. In matrix squareroot problem d = v?, which gives total O (n'^®) dependence. 
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Method 

Runtime 

Global 

convergence 

Provable 

robustness 

Gradient descent (this paper) 

O {an^ logi) 

/ 

/ 

Stochastic gradient descent Ge 

O(ni7poly(e)) 

/ 

/ 

et al. 

2015 



Newton variants 

Higham 

2008 


O (n^ log log i) 

X 

X 

EVD (algebraic I 

’an et al. 

1998 

) 

O {n^ log n + n log^ n log log \) 

Not iterative 

X 

EVD (power method 

Golub and 

0{nHog\) 

Not iterative 

X 

Van Loar 

201 

2) 


Table 1: Comparison of our result to existing ones. Here uj is the matrix multiplication exponent and a is 
our convergence rate parameter defined in Eq.(|^. We show that our method enjoys global convergence and 
is also provably robust to arbitrary bounded errors in each iteration. In contrast, Newton variants only have 
local convergence and their robustness to errors in multiple iterations is not known. Robustness of methods 
based on eigenvalue decomposition is also not well understood. 


1.2 Our contribution 

In this paper, we propose doing gradient descent on the following non-convex formulation: 

min IlM-U^llp. (2) 

We show that if the starting point Uo is chosen to be a positive definite matrix, our algorithm converges 
to the global optimum of Eq. (H) at a geometric rate. In order to state our runtime, we make the following 
notation: 


A 
a = 


max(||Uo||^, V^IIMIIa) 

min ^(Tniin (Uq) > min (M)^ 


(3) 


where u-min (Uo) and ||Uo ||2 are the minimum singular value and operator norm respectively of the starting 
point Uo, and (Tmin (M) and ||M ||2 are those of M. Our result says that gradient descent converges e 


M-Uo^ 


iterations. Each iteration involves doing only 


close to the optimum of Eq. (§ in O ^Qflog 
three matrix multiplications and no inversions or leastsquares. So the t otal runtime of our algorithm is 
O [n^alog ^ where lo < 2.373 is the matrix multiplication exponent 


Williams 


2012 


. As a byproduct 

of our global convergence guarantee, we obtain the robustness of our algorithm to numerical errors for free. 
In particular, we show that our algorithm is robust to errors in multiple steps in the sense that if each step 
has an error of at most 6, then our algorithm achieves a limiting accuracy of O (a-\/||M||2i5). Another nice 
feature of our algorithm is that it is based purely on matrix multiplications, where as most existing methods 
require matrix inversion or solving a system of linear equations. An unsatisfactory part of our result however 
is the dependence on a > , where k is the condition number of M. We prove a lower bound of il (k) 

iterations for our method which tells us that the dependence on problem parameters in our result is not a 
weakness in our analysis. 

Outline: In Sectionwe will briefly set up the notation we will use in this paper. In Sectionwe will 
present our algorithm, approach and main results. We will present the proof of our main result in Section]^ 
and conclude in Section The proofs of remaining results can be found in the Appendix. 
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Algorithm 1 Gradient descent for matrix square root 
Input: M, PD matrix \Jq, rj,T 

Output: U 

for t = 0, • • • , r — 1 do 

Ut+i = Ut - ry (Ut^ - M) Ut - ryUt (Ut^ - M) 
end for 
Return Ut- 


2 Notation 

Let us briefly introduce the notation we will use in this paper. We use boldface lower case letters (v, w,...) 
to denote vectors and boldface upper case letters (M, X,...) to denote matrices. M denotes the input matrix 
we wish to compute the squareroot of. ai (A) denotes the singular value of A. (Tmin (A) denotes the 
smallest singular value of A. k, (A) denotes the condition number of A i.e., —^ without an argument 

<^min 

denotes k (M). A,; (A) denotes the largest eigenvalue of A and Amin (A) denotes the smallest eigenvalue 
of A. 


3 Our Results 


In this section, we present our guarantees and the high-level approach for the analysis of Algorithm which 
is just gradient descent on the non-convex optimization problem: 


min M — U 


|2 

If ■ 


(4) 


We first present a warmup analysis, where we assume that all the iterates of Algorithm commute with 
M. Later, in Section 3.2 we present our approach to analyze Algorithmfor any general starting point Uq. 
We provide formal guarantees in Section 3.3 


3.1 Warmup — Analysis with commutativity 

In this section, we will give a short proof of convergence for Algorithm [l] when we ensure that all iterates 
commute with M. 

Lemma 3.1. There exists a constant c such that if rj < o.iT'd Uq is chosen to be •y/||M][^ • I, then Ut 

in Algorithm^ satisfies: 


||Ut" - M||^ < exp (-277amin (M) t) HUo' - M||^ . 

Proof Since Uq = •\/||M][^I has the same eigenvectors as M, it can be seen by induction that Ut has the 
same eigenvectors as M for every t. Every singular value Ci (Ut+i) can be written as 

(Ut+i) = (l - 277 (Ut)' - cj, (M)) ) T, (Ut). (5) 

Firstly, this tells us that ||Ut ||2 < \/2 ||M ||2 for every t. Verifying this is easy using induction. The 
statement holds for t = 0 by hypothesis. Assuming it holds for Ut, the induction step follows by considering 
the two cases ||Ut II 2 < and ./\\Mf 2 < l|Ut|l 2 < a /2 IIMII 2 separately and using the assumption 

that 77 < p 3 ]p- A similar induction argument also tells us that crt (Ut) > Eq.@ can now be used 
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to yield the following convergence equation: 


a, (Ut+i)"-a, (M) 

(Ut)'-a,(M)| • (l- 477a,(Utf 
+Va,(Ut)' (a, (Ut)'-a,(M))) 


< 


a, (Ut)'-a,(M)| • (l- 477 a,(Utf 
+8?7V, (Ut)' IIMII 2 ) 

< (1 - 277U, (Ut)") |a, (Ut)" - cT, (M) 

< exp (-??crmin (M)) (Ti (Ut)^ - (Ti (M) , 

where we used the hypothesis on rj in the last two steps. Using induction gives us 

o'*(Ut)^-M < exp (-77(Tini„ (M) t) (Uq)^ - M . 
This can now be used to prove the lemma: 


|U,2 - M||^ = ^ (a, (Ut)" - a, (M))^ 


< exp {-2T]a^in (M) t) ^ {ai (Uq)^ - cr* (M) j 


<exp(-2?7gmin (M)t) ||Uo^ - M| 


F ■ 


□ 


Note that the above proof crucially used the fact that the eigenvectors of Ut and M are aligned, to reduce 
the matrix iterations to iterations only over the singular values. 


3.2 Approach 

As we begin to investigate the global convergence properties of Eq.Q, the above argument breaks down 
due to lack of alignment between the singular vectors of M and those of the iterates Ut. Let us now take a 
step back and consider non-convex optimization in general. There are two broad reasons why local search 
approaches fail for these problems. The first is the presence of local minima and the second is the presence 
of saddle points. Each of these presents different challenges: with local minima, local search approaches 
have no way of certifying whether the convergence point is a local minimum or global minimum; while with 
saddle points, if the iterates get close to a saddle point, the local neighborhood looks essentially flat and 
escaping the saddle point may take exponential time. 

The starting point of our work is the realization that the non-convex formulation of the matrix squareroot 
problem does not have any local minima. This can be argued using the continuity of the matrix squareroot 
function, and this statement is indeed true for many matrix factorization problems. The only issue to be 
contended with is the presence of saddle points. In order to overcome this issue, it suffices to show that the 
iterates of the algorithm never get too close to a saddle point. More concretely, while optimizing a function 
/ with iterates Ut, it suffices to show that for every t, Ut always stay in some region T) far from saddle 
points so that for all U, U' G T): 

IIV/(U)-V/(U')||^<L||U-U'||^ (6) 

||V/(U)||^ > V^(/(U)-/*), (7) 
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where /* = ininu/(U), and L and i are some constants. If we flatten matrix U to be n^-dimensional 
vector, then Eq.([^ is the standard smoothness assumption in optimization, and Eq.Q is known as gradient 


dominated property [Polyak 1963 Nesterov and Polyak 2006 . If Eq.([^ and Eq.Q hold, it follows from 
standard analysis that gradient descent with a step size ry < ^ achieves geometric convergence with 


/(Ut) -U< exp {-ii£t/2) (/(Uo) - U). 

Since the gradient in our setting is (Ut^ — M) Ut +Ut (Ut^ — M), in order to establish Eq.Q, it sufhces to 
lower bound Amin (Ut). Similarly, in order to establish Eq.Q, it suffices to upper bound ||Ut||2- Of course, 
we cannot hope to converge if we start from a saddle point. In particular Eq.Q will not hold for any I > 0. 
The core of our argument consists of Lemmas 4.3 and 4.2 which essentially establish Eq. (H and Eq.0 
respectively for the matrix squareroot problem Eq.(^, with the resulting parameters I and L dependent on 
the starting point Uq. Lemmas |4.3| and |4.2| accomplish this by proving upper and lower bounds respectively 
Utllj and Amin (Ut). The proofs of these lemmas use only elementary linear algebra and we believe such 


on 


results should be possible for many more matrix factorization problems. 


3.3 Guarantees 

In this section, we will present our main results establishing that gradient descent on @ converges to the 
matrix square root at a geometric rate and its robustness to errors in each iteration. 


3.3.1 Noiseless setting 

The following theorem establishes geometric convergence of Algorithm from a full rank initial point. 

Theorem 3.2. There exist universal numerical constants c and c such that if Uq is a PD matrix and 
rj < , then for every t € [T — 1], we have Ut he a PD matrix with 

||M-Ut"||^ <exp(-E77^2^) ||M-Uo'||^, 

where a and j3 are defined as 

f max(||Uo||^, v^llMII^) \ 

ymin (Uq) , min (M))y 

/3 = min (^CTmin (Uo) , a/CT min (M)^ 


Remarks: 


This result implies global geometric convergence. Choosing rj = in order to obtain an accuracy of 
e, the number of iterations required would he O [a log ^ 


• Note that saddle points of Q must be rank degenerate matrix (crmin(U) = 0) and starting Algorithmj^ 
from a point close to the rank degenerate surface takes a long time to get away from the saddle surface. 
Hence, as Uq gets close to being rank degenerate, convergence rate guaranteed by Theorem |3.2| degrades 
(as K (Uq)^). It is possible to obtain a smoother degradation with a finer analysis, but in the current 
paper, we trade off optimal results for a simple analysis. 


The convergence rate guaranteed by Theorem 3.2 also depends on the relative scales of Uo and M (say 
as measured by IIU 0 II 2 / IIMII 2 ) and is best if it is close to 1. 


• We believe that it is possible to extend our analysis to the case where M is low rank (PSD). In this 
case, suppose rank(M) = fc, and let U* be the k-dimensional subspace in which M resides. Then, 
saddle points should satisfy CTfc(U^U*) = 0. 
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A simple corollary of this result is when we ch oose Uq = AI, where ||M ||2 < A < 2 |1M||2 (such a A can be 
found in time O (n^) 


Musco and Musco 


2015 


)• 


Corollary 3.3. Suppose we choose Uo = AI, where ||M ||2 < A < 2||M||2. Then ||M —Ut^ 


< e for 


3.3.2 Noise Stability 

Theorem |3.2| assumes that the gradient descent updates are performed with out any error. This is not 
practical. For instance, any implementation of Algorithm would incur rounding errors. Our next result 
addresses this issue by showing that Algorithm is stable in the presence of small, arbitrary errors in each 
iteration. This will establish the stability of our algorithm in the presence of round-off errors for instance. 
Formally, we consider in every gradient step, we incur an error At. 

The following theorem shows that as long as the errors At are small enough. Algorithm recovers the 
true squareroot upto an accuracy of the error floor. The proof of the theorem follows fairly easily from that 
of Theorem 13.21 

Theorem 3.4. There exist universal numerical constants c and c such that the following holds: Suppose Uq 
is a PD matrix and p < where a and j3 are defined as before: 


^ A I max(||Uo|| 2 , v^||M|| 2 ) \ 

ymin (Uq) , \/tT min (M))y 

/3 = min (Uq) , CTmin (M)^ . 

Suppose further that ||At ||2 < ;^?7CTmin (M)/3. Then, for every t G [T— 1], we have Ut he a PD matrix with 

||M-Ut"||^ <exp(-F7?/32t) ||M-Uo'||^ 

t-i 

+ 4max(|lUo|i2 , ^/sl^) ^ || A,||^ . 

s=0 


Remarks: 

• Since the errors above are multiplied by a decreasing sequence, they can be bounded to obtain a 
limiting accuracy of O (a(||Uo II2 + \/||M|| 2 )(sup, IIAsIIp)). 

• If there is error in only the first iteration i.e.. At = 0 for t ^ 0, then the initial error Aq is attenuated 
with every iteration. 


|M - Ut^ll^ < exp {-cppH) ||M - Uo^ 


-H6max(||Uo|l2,||M||2)e 




IIAol 


F ■ 


That is, our dependence on ||Ao|jp, is exponentially decaying with respect to time t. On the contrary, 
best known results only guarantees the error dependence on ||Ao||p will not increase significantly with 
respect to time t Higham[ 2008]. 


3.3.3 Lower Bound 

We also prove the following lower bound showing that gradient descent with a fixed step size requires O (k) 
iterations to achieve an error of O ((Tmin (M)). 
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Theorem 3.5. For any value of k, we can find a matrix M such that, for any step size rj, there exists an ini¬ 
tialization Uo that has the same eigenvectors as M, with ||Uo II2 < \/3 IIMII2 andamin (Uq) > -^^/cr min (M), 
such that we will have ||Ut^ — M 11^ > jO-min (M) for all t<K. 


This lemma shows that the convergence rate of gradient descent fundamentally depends on the condition 
number k, even if we start with a matrix that has the same eigenvectors and similar scale as M. In this case, 
note that the lower bound of Theorem 3.5 is off from the upper bound of Theorem 3.2 by t/K. Though we 
do not elaborate in this paper, it is possible to formally show that a dependence is the best bound 

possible using our argument (i.e., one along the lines of Section 3.2). 


4 Proof Sketch for Theorem 3.2 


In this section, we will present the proof of Theorem 3.2 To make our strategy more concrete and transparent, 
we will leave the full proofs of some technical lemmas in Appendix [A| 

At a high level, our framework consists of following three steps: 


1 . Show all bad stationary points lie in a measure zero set {U|(/ii(U) = 0} for some constructed potential 
function </)(•). In this paper, for the matrix squareroot problem, we choose the potential function (^(•) 
to be the smallest singular value function tTmin (•)■ 

2. Prove for any e > 0, if initial Uq S = {U| |(/!)(U)| > e} and the stepsize is chosen appropriately, 
then we have all iterates Ut S T>^. That is, updates will always keep away from bad stationary points. 

3. Inside regions show that the optimization function satisfies good properties such as smoothness 
and gradient-dominance, which establishes convergence to a global minimum with good rate. 


Since we can make e arbitrarily small and since {U|(/ii(U) = 0} is a measure zero set, this essentially establishes 
convergence from a (Lebesgue) measure one set, proving global convergence. 

We note that step 2 above implies that no stationary point found in the set {U|0(U) = 0} is a local 
minimum - it must either be a saddle point or a local maximum. This is because starting at any point 
outside {U|(/>(U) = 0} does not converge to {U|(/)(U) = 0}. Therefore, our framework can be mostly used 
for non-convex problems with saddle points but no spurious local minima. 

Before we proceed with the full proof, we will first illustrate the three steps above for a simple, special 
case where n = 2 and all relevant matrices are diagonal. Specifically, we choose target matrix M and 
parameterize U as: 


M = 




Here x and y are unknown parameters. Since we are concerned with U ^ 0, we see that x,y > 0. The reason 
we restrict ourselves to diagonal matrices is so that the parameter space is two dimensional letting us give a 
good visual representation of the parameter space. Figures and show the plots of function value contours 
and negative gradient flow respectively as a function of x and y. 

We will use Figures and to qualitatively establish the three steps in our framework. 

1. From Figure]^ we note that (2, v^) is the global minimum. (2, 0), (0, v^) are saddle points, while 
(0, 0) is local maximum. We notice all the stationary points which are not global minima lie on the 
surface o'min(U) = 0, that is, the union of x-axis and y-axis. 

2. By defining a boundary {U|(Tmin(U) > c, ||U ||2 < C} for some small c and large C (corresponding to 
the red box in Figure]^, we see that negative gradient flow is pointed inside the box which means that 
for any point in the box, performing gradient descent with a small enough stepsize will ensure that all 
iterates lie inside the box (and hence keep away from saddle points). 











Figure 1: Contour of Objective Functions 


Figure 2: Flow of Negative Gradient 


3. Inside the red box, Figurej^shows that negative gradient flow points to the global optimum. Moreover, 
we can indeed establish upper and lower bounds on the magnitude of gradients within the red box - 
this corresponds to establishing smoothness and gradient dominance respectively. 

Together, all the above observations along with standard results in optimization tell us that gradient 
descent has geometric convergence for this problem. 

We now present a formal proof of our result. 


4.1 Location of Saddle Points 

We first give a characterization of locations of all the stationary points which are not global minima. 

Lemma 4.1. Within symmetric PSD cone {U|U ^ 0}, all stationary points of /(U) = ||M — U^||p which 
are not global minima, must satisfy (Tniin (U) = 0. 

Proof. For any stationary point U' of /(U) which is not on the boundary {U|(Tmin(U) = 0}, by linear algebra 
calculation, we have: 

0 =||V/(U')||^ = ||(U'2 - M)U' + U'(U'2 - M)\\% 

= ((U'2 - M)U' + U'(U'2 - M), 

(U'2 _ M)U' + U'(U'2 - M)) 

=2Tr([(U'2 - M)U']2) + 2Tr((U'2 - 
>4nii„(U')||U'2-M||| 

Therefore, since U' is not on the boundary of PSD cone, we have cr^i,j(U) > 0, which gives /(U') = 
||M — U'^ll^ yf 0, thus U' is global minima. □ 

As mentioned before, note that all the bad stationary points are contained in {U|CTniin (U) = 0} which is 
a (Lebesgue) measure zero set. 


9 





































4.2 Stay Away from Saddle Surface 

Since the gradient at stationary points is zero, gradient descent can never converge to a global minimum if 
starting from suboptimal stationary points. Fortunately, in our case, gradient updates will keep away from 
bad stationary points. As in next Lemma, we show that as long as we choose suitable small learning rate, 
CTmin (Ut) will never be too small. 


Lemma 4.2. Suppose f] < c 


min (Uo ), (M) / lo) 


(||Uo||3,(3||M||2) 


3 / 2 \^> where c is a small enough constant. Then, for every 


t € [T — 1], we have Ut in Algorithm^ he a PD matrix with 


'^min (Ut) > min 



a/CT min (M) 


10 


It turns out that the gradient updates will not only keep tTmin (U) from being too small, but also keep 
IIUII 2 from being too large. 


Lemma 4.3. Suppose rj < 


_ 1 _ 

10max(||Uo||^,3||M||2) ' 


For every t S [T — 1], we have Ut in 




satisfying: 



Although IIUII 2 is not directly related to the surface with bad stationary points, the upper bound on 
IUII 2 is crucial for the smoothness of function /(•), which gives good convergence rate in Section 4.3 


4.3 Convergence in Saddle-Free Region 

So far, we have been able to establish both upper bounds and lower bounds on singular values of all iterates 
Ut given suitable small learning rate. Next, we show that when spectral norm of U is small, function /(U) 
is smooth, and when CTmin (U) is large, function /(U) is gradient dominated. 

Lemma 4.4. Function/(U) = ||M — U^||^ is 8maji{r, HMW^j-smooth in region {UWllJW^ < T}. That is, 
for any Ui,U 2 G {U| ||U ||2 < F}, we have: 

l|V/(Ui) - V/(U 2 )||f < 8 max{F, ||M|j 2 }||Ui - U 2 IIF 

Lemma 4.5. Function /(U) = ||M — U^||^ is A'y-gradient dominated in region {UjCTniin (U)^ > 7 }. That 
is, for any U G {U|(Tmin (U)^ > 7 }, we have: 

||V/(U)|||>47/(U) 

Lemma 4.4 and 4.5 are the formal versions of Eq.Q and Eq. 0 in Section |3.2[ which are essential in 
establishing geometric convergence. 

Putting all pieces together, we are now ready prove our main theorem: 


Proof of Theorem \3.^ Recall the definitions in Theorem |3.2| 

max(||Uo||2, V/IIMII 2 ) 


A 
a = 


min ^(7min (Uo) J min (M)^ 
/3 = min CTmin (Uo) , a/ CTmin (M)^ 
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By choosing learning rate rj < with small enough constant c. We can satisfy the precondition of 
Lemma |4.2[ and Lemma |4.3| at same time. Therefore, we know all iterates will fall in region: 


U 


U||2<max('||Uo||2,y3||M||, 


Amin (U) > min CTmin (Uq) , 


x/o-min (M) 


10 


Then, apply Lemma 
smoothness parameter: 


4.4 


and Lemma 


4.5 


we know in this region, function /(U) = 


U^-M ; has 

I Mr' 


8 max I max {||Uo||^3||M||2},||M||2}< 240^/3^2 
and gradient dominance parameter: 


4 min 



(Tmin (M) 
100 



That is, /(U) in the region is both 2^0 ^/^and /3^/25-gradient dominated. 
Finally, by Taylor’s expansion of smooth function, we have: 


/(Ut+i) </(Ut) + (V/(Ut),Ut+i - Ut) 

=/(Ut) - (77 - ||V/(Ut)|l^ 

</(Ut)-|llV/(Ut)|!?. 

<(i- 4 )/(u.) 

The second last inequality is again by setting constant c in learning rate to be small enough, and the last 
inequality is by the property of gradient dominated. This finishes the proof. 

□ 


5 Conclusion 

In this paper, we take a first step towards addressing the large gap between local convergence results with 
good convergence rates and global convergence results with highly suboptimal convergence rates. We consider 
the problem of computing the squareroot of a PD matrix, which is a widely studied problem in numerical 
linear algebra, and show that non-convex gradient descent achieves global geometric convergence with a 
good rate. In addition, our analysis also establishes the stability of this method to numerical errors. We 
note that this is the first method to have provable robustness to numerical errors for this problem and our 
result illustrates that global convergence results are also useful in practice since they might shed light on the 
stability of optimization methods. 

Our result shows that even in the presence of a large saddle point surface, gradient descent might be 
able to avoid it and converge to the global optimum at a linear rate. We believe that our framework and 
proof techniques should be applicable for several other nonconvex problems (especially those based on matrix 
factorization) in machine learning and numerical linear algebra and would lead to the analysis of gradient 
descent and stochastic gradient descent in a transparent way while also addressing key issues like robustness 
to noise or numerical errors. 
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A Proofs of Lemmas in Section [4] 


In this section, we restate the Lemmas in Sectionwhich were used to prove Theorem |3. 2 [ and present their 
proofs. 

First, we prove lemmas which show a lower bound and a upper bound on the eigenvalues of the inter¬ 
mediate matrices Ut in Algorithm [l] This shows Ut always stay away from the surface where unwanted 
stationary point locate. 


Lemma A.l (Restatement of Lemma 


enough constant. Then, for every t S [T — 


4.2). Suppose rj < 


cmm cr„i„ 


(Uo)V<^„i„(M)/lo) 


nax(||Uo||i.(3||M||2)"^"') 

I, we have Ut be a PD matrix with 


where c is a small 


'^min (Ut) ^ min I (Jmin (Uq) j 


a/ (Tinin (M) 


10 


Proof. We will prove the lemma by induction. The base case t = 0 holds trivially. Suppose the lemma holds 
for some t. We will now prove that it holds for t -|- 1. We have 

Aniin (Ut+i) =A„,i„ (Ut - g (Ut' - M) Ut - (Ut' - M)) 


>An 


Ut - 2r,Vt^ + A^i„ -Ut + r;(MUt + UtM) 


1 


=An,i„ -Ut - 277Ut" + A„,i„ -I -b r;M Ut -I + 77 M - r^^MUtM 


1 . 


( 8 ) 


When 77 < 


100max(||Uo||^.3||M||2) 


using Lemma 


4.3 


we can bound the first term as 


Amin yUt - 277Ut^ = T^min (Ut) - 2T]a^in (Ut)^ 


(9) 


To bound the second term, for any vector w S K" with ||w |]2 = 1, let w = where is the i 


th 


eigenvector of M, and = 1- Then: 


1 , 


w ' -I -b r/M Ut -I -b 7?M - r^^MUtM w 


>Amin (Ut) 


-I -b T^M ) W 


-ry'IIUtll^llMwII 


n y ^ \ 2 n 

= Amin (Ut) E 2 + 'I 2 E (M)' 

n y 

= Amin (Ut) E 4 + (Ut) A (M) 

(Ut) E a? (^ + W* (M) + 77 V. (M)' - (Ut) a. (M)'] 

(Ut) E “i (^ + (^)) - (Ut) Q + ^Wmin (M) 


( 6 ) 

— i^min 


(C2) 

> cr. 


( 10 ) 


where (Ci) is due to the fact that Ut is a PD matrix, so Amin (Ut) = CTmin (Ut) > 0, and {( 2 ) is because since 

min ( (Tuiin (Uq ) /lO ) -1 i O /-wr \ /n ir\‘2 1 /■» ir\ 

77 < - ^^-7 -—- ' < 75 —nTXiTiOfTr J we have n^K (Ut) (Ji (M) < fnai (M). 

max(||Uo||^.(3||M||2)^^") “ 2K(Ut)||M||2 ’ ' V t7 7 -2' ) 
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Plugging Eq.([^ and Eq.(lO) into Eq.([^, we have: 


Amin (Ut+l) > CTmin (Ut) (1 + “Wmin (M) - 2?7(Tmin (Ut) 


When CTmin (Ut) < \/anun (M)/3, we obtain: 


Amin (Ut+l) ^ f^min (Ut) ^ niax ^fTmin (Uq) 

and when Umin (Ut) > a/ Cmin (M)/3, we have: 

•^min (Ut+i) (Ut) (1 - 2r?crmin (Ut)^) 


^CTmin (M) 


10 


> CTmin (Ut) (1 - 2r] ||Ut|l 2 ) ^ fo’inin (Uq) , 


This concludes the proof. 


□ 


Lemma A. 2 (Restatement of Lemma 4.3). Suppose rj < — - . / ^ ^ every f S [T — 1], we 

10 max{^||Uo||2 ,3||M||2 j 

have: 


Ut||2<max('||Uo||2,73|!M|| 


Proof. We will prove the lemma by induction. The base case t = 0 is trivially true. Supposing the statement 
is true for Ut, we will prove it for Ut+i. 

Using the update equation of Algorithm [T} we have: 

||Ut+i||2 = ||Ut-r?(Ut'-M) Ut-iyUt (Ut'-M)||2 
= II (I - 277Ut^) Ut + ?7MUt + JyUtMlI^ 

< ||(I-2T7Ut2)Ut||2+2r7||M||2||Ut||2. (11) 


The singular values of the matrix (I — 2?7Ut^) Ut are exactly (1 — 2r]a^) ■ a where cr is a singular value 
of Ut. For CT < a/2 IIMII 2 , we clearly have (1 — 2pa‘^)a < ^2 |jM|| 2 . On the other hand, for a > a/2 ||M|| 2 , 
we have (1 — 2r]a‘^)a < (1 — dry ||M|| 2 )ct. Plugging this observation into Eq.( 11), we obtain: 


|U 


t+l|l2 


< max (^y2||M|l2, (1 - dry \\M\\^) HUtl/j + 2ry \\M\\^ HUtl/ 

< max (^^ 2 |lM|j 2 + ^ |lUt ||2 , ||Ut|| 2 ) < max (^HUol/ , 
where we used the inductive hypothesis in the last step. This proves the lemma. 


□ 


Finally, we prove the smoothness and gradient dominance in above regions. 


Lemma A .3 (Restatement of Lemma 4.4). For any Ui,U2 S {U| ||U|j 2 < T}, we have function /(U) = 


M- 


satisfying: 


||V/(Ui) - V/(U2)||+ < 8max{r, HMI/jHUi - U2IIF 


( 12 ) 
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Proof. By expanding gradient V/(U), and reordering terms, we have: 


||V/(Ui)-V/(U2)||f 

= ||(2Ui 3 - MUi - UiM) - (2U2^ - MU 2 - U2M)|jF 
= ||2 (Ui3 _ U 2 ') - M(Ui - U 2 ) - (Ui - U2)M||^ 

<2||M||2||Ui - U 2 IIF + 2||Ui3 _ U2 '||f 

=2||M||2||Ui - U 2 IIF + 2||Ui2(Ui - U 2 ) + Ui(Ui - U2)U2 + (Ui - U2)U2 '||f 

< 2 ||m|| 2 ||Ui - U 2 IIF + eriiUi - U 2 ||f 

<8max{r,|jM||2}||Ui-U2||F 


□ 


Lemma A. 4 (Restatement of Lemma 
||M — U^llp satisfying: 

||V/(U)|i|>47/(U) (13) 

Proof. By expanding gradient V/(U), we have: 


4.5). For any U G {U|cri„in (U) > 7 }, we have function /(U) = 


||V/(U)||| = ||(U2 - M)U + U(U2 - M)||| 

= ((U2 - M)U + U(U2 - M), (U^ - M)U + U(U2 - M)) 
>4a^i„(U)||U2-M|l|. = 47/(U) 


□ 


B Proof of Theorem 3.4 


In this section, we will prove Theorem 3.4 We first state a useful lemma which is a stronger version of 
Lemma K2\ 


Suppose further that n < where c is a small enough constant and denote 

max(||Uo||2,-y/3||M||2)3/2 

Ut - 77 (Ut" - M) Ut - pVt (Ut" - M) . Then, Ut+i is a PD matrix with: 


Lemma B.l. Suppose Ut is a PD matrix with ||Ut |j 2 < iiiax(||Uo|i 2 > \/3~||M][^), and CTmin (Ut) > min(crniin (Uo), \/ Cmin (1 

Ut+i = 


Amin (Ut+l) - (^1 + ^P'min^(M) ^ ^ ^ VcTmin (M)). 


Indeed, our proof of Lemma |4 .2 1 already proves this stronger result. Now we are ready to prove Theorem 


Proof of Theorem \3.4\ The proof of the theorem is a fairly straight forward modification of the proof of 
Theorem |3.2| We will be terse since for most part we will use the arguments employed in the proofs of 
Theorem 13.21 and Lemmas 14.31 and 14.21 

We have the following two claims, which are robust versions of Lemmas |4.3| and [4 .2 [ bounding the spectral 
norm and smallest eigenvalue of intermediate iterates. The proofs will be provided after the proof of the 
theorem. 

Claim B.2. For every t G [T — 1], we have: 


||Ut||2<max(||Uo||2 V3IIMII2). 
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Claim B.3. For every t G [T — 1], we have Ut he a PD matrix with 

(f-r \ ^ • / /tt ^ '\/'^min (M) 

^min (Ut) ^ (Uq) ; * " ’ ^ “)■ 

We prove the theorem by induction. The base case t = 0 holds trivially. Assuming the theorem is true 
for t, we will show it for t + 1. Denoting Ut+i = Ut — rj (Ut^ — M) Ut — ??Ut (Ut^ — M), we have 


|M-Ut+i2|L = 


< 


M - - Ut+iAt - AtUt+i - At 


M-Ut%i 


U. 


t+i 


Atllp' + II 


(14) 


Using Claims [B?^ and p3.3[ Lemma [4)3| tells us that 


U. 


t+i 


< max(||Uo||2,t/3|lM||2) 


and Theorem 13.21 tells us that 


M-UtV 


< exp (^-c77min(CTmi„ (Uq)^ ,crj„in (M))^ ||M - Ut" 


Plugging the above two conclusions into (14), tells us that 
211 


M- U 


t+i 


< exp (^-C77min((Ti„in (Uq)^ ,cri„in (M))^ ||M- Ut^||^ + 2 max(||Uo ||2 , y^3 ||M|| 2 ) ||A 


t|lF 


■^Wmin (M) min( <^min (Uo) , min (M))||At||^ 


< exp (^-C77min((Tinin (Uo)^ ,crmin (M))^ ||M-Uo^||^ 

t 

+ 4 max(||Uo ||2 , ^3 IIMII 2 ) ^ exp (^-CTymin((Tmin (Uq)^ , CTmin (M))(t - s)) || A 


s II ^ ; 


s=0 


where we used the induction hypothesis in the last step. 
We now prove Claim [B^ 


□ 


Proof of Claim \B7^ Just as in the proof of Lemma |4.3[ we will use induction. Assuming the claim is true 
for Ut, by update equation 


Ut+i = Ut - 7? (Ut" - M) Ut - ryUt (Ut" - M) + At 


We can write out: 


|Ut+i ||2 < II (I - 2r?Ut") Ut ||2 + r? ||M ||2 HUtH^ + V llUtH^ l|M ||2 + ||A 


t|l 2 • 


(15) 


Since 77 < ioniax(||Uo||^,||M||,;) l|Ut|l 2 < max(||Uo ||2 , \/3 |!M|| 2 ), note that the singular values of the 
matrix (l — 277Ut") Ut are exactly (1 — 2r]a‘^) ■ a where cr is a singular value of Ut. For cr < 1^/2 ||M|| 2 , 
we clearly have (1 — 2ria^)a < \J2 ||M|| 2 . On the other hand, for o > \J2 ||M|| 2 , we have (1 — 2r]a^)a < 
(1 — 477 ||M|| 2 )cr. Plugging this observation into (15), we obtain: 


||Ut+i ||2 < max (^^ 2 ||M|| 2 , (1 - 4 r 7 IIMII 2 ) HUtH^ j + 2r, ||M ||2 ||Ut ||2 + || At| 
< max(||Uo ||2 , y 3 ||M|| 2 ), 

proving the claim. 


□ 
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We now prove Claim B.3 


Proof of Claim \B7^ We will use induction, with the proof following fairly easily using Lemma |B.1| Suppose 
o-min (Ut) > min(cri„in (Uo), (M)). Denoting 


Ut+i = Ut - 7? (Ut^ - M) Ut - TyUt (Ut^ - M) , 

Lemma rB.il tells us that 

CTmin (Ut+l) - + TyO-mm^CM) ^ ^ ^ y/cTmin (^), 

which then implies the claim, since 

^min (Ut+l) ^ fT'min (Ut+l) ||At||2 A mill(fJmin (Uq) , V^^niin 

□ 


C Proof of Theorem 3.5 


In this section, we will prove Theorem |3.5[ 
Proof. Consider two-dimensional case, where 

M = 


|M||, 


0 


0 CTmin (M) 


We will prove Theorem 3.5 by considering two cases of step size (where rj > or ry < yp^jip) 

separately. 


Case 1 : For step size ry > Let /3 = 


2r,||M||, 


1 , and consider following initialization Uq: 


Uo = 


0 

y 0 a/ow7(^^ 


'2(/3-1)^/3 |!M|i^ 0^ 


Since ry > we know /3 < 3, which satisfies our assumption about Uo and M. By calculation, we have: 

Uo(Uo"-M) + (Uo2-M)Uo = 
and since 2r]{/3 — 1) ||M ||2 = 1, we have: 

Ui = Uo - ry[Uo(Uo' - M) + (Uq" - M)Uo] = ~ ^ i] ^ (l) 1 


Then, by induction we can easily show for all t > 1, Ut = Ui, thus ||Ut^ — Mil > HMj^ > jCTmin (M) . 
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Case 2 : For step size rj < consider following initialization Uq: 

VWf2 0 ^ 

0 |\/omin (M) j 

According to the update rule in Algorithm we can easily show by induction that: for any t > 0, 
form: 

VWf2 0 ^ 

0 at^dmin (M) j 

where at is a factor that depends on t, satisfying 0 < Ot < 1 and: 

Ot+l = Q:t[l + Wmin (M) (1 - Oj)], «0 = ^ 

Since r] < we know: 

Ot+l 

Therefore, for all t < k, we have a* < | 


<o.[l + -(l-<>?)]<», + - 
, and thus ||Ut^ - M||^ > iomin (M). 




Ut is of 


□ 
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